Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Interactive indexation and transcription of historical printed books

Identifieur interne : 000096 ( France/Analysis ); précédent : 000095; suivant : 000097

Interactive indexation and transcription of historical printed books

Auteurs : Jean-Yves Ramel [France] ; Nicolas Sidère [France]

Source :

RBID : Hal:hal-01026493

Abstract

This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Hal:hal-01026493

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Interactive indexation and transcription of historical printed books</title>
<author>
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidère">Nicolas Sidère</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-01026493</idno>
<idno type="halId">hal-01026493</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-01026493</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-01026493</idno>
<date when="2011-06-19">2011-06-19</date>
<idno type="wicri:Area/Hal/Corpus">000069</idno>
<idno type="wicri:Area/Hal/Curation">000069</idno>
<idno type="wicri:Area/Hal/Checkpoint">000085</idno>
<idno type="wicri:Area/Main/Merge">000347</idno>
<idno type="wicri:Area/Main/Curation">000342</idno>
<idno type="wicri:Area/Main/Exploration">000342</idno>
<idno type="wicri:Area/France/Extraction">000096</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Interactive indexation and transcription of historical printed books</title>
<author>
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
<author>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidère">Nicolas Sidère</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-204893" status="VALID">
<orgName>Laboratoire d'Informatique de l'Université de Tours</orgName>
<orgName type="acronym">LI</orgName>
<desc>
<address>
<addrLine>64, Avenue Jean Portalis, 37200 Tours</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.li.univ-tours.fr/</ref>
</desc>
<listRelation>
<relation active="#struct-300408" type="direct"></relation>
<relation name="EA6300" active="#struct-300298" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-300408" type="direct">
<org type="institution" xml:id="struct-300408" status="VALID">
<orgName>Polytech'Tours</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
<tutelle name="EA6300" active="#struct-300298" type="direct">
<org type="institution" xml:id="struct-300298" status="VALID">
<orgName>Université François Rabelais - Tours</orgName>
<desc>
<address>
<addrLine>60 rue du Plat d'Étain, 37020 Tours cedex 1 </addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.univ-tours.fr</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
<placeName>
<settlement type="city">Tours</settlement>
<region type="old region" nuts="2">Région Centre</region>
<region type="region" nuts="2">Centre-Val de Loire</region>
</placeName>
<orgName type="university">Université François-Rabelais de Tours</orgName>
<orgName type="institution" wicri:auto="newGroup">Centre Val de Loire Université</orgName>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This article describes the work performed in the Pattern Redundancy Analysis for Document Image Indexing and Transcription research project. The project focused on layout analysis, text/graphics separation, optical character recognition (OCR), and text transcription processes dedicated to old and precious books. The originality of this work relies on the analysis and exploitation of pattern redundancy in documents to enable the efficient indexing and quick transcription of books and the identification of typographic materials. For these purposes, we have developed two software packages. The first, AGORA, performs page layout analysis, text/graphics separation, and pattern (letterform) extraction simultaneously. These patterns are then processed to group similar patterns together in single clusters so that different letterforms of a book can be extracted and analysed to compute redundancy rates. This process allows a significant reduction of the number of letterforms to be recognized. Once the clustering of letterforms is done, a user may assign a label to each cluster using the second software, RETRO. Labels are then automatically assigned to each corresponding character to perform the text transcription of the whole book. Thus, if 90% of the letterforms are detected as redundant, only one character out of ten must be labelled by the user to transcribe the book. Moreover, this transcription method allows us to deal easily with the special characters that appear frequently in old books. It is also possible to use our clustering approach to extract and create new font packages from specific printing material (e.g. from rare books printed with particular types or woodblocks). These new font packages could be incorporated into the training step of optical fonts recognition methods to improve the recognition results of OCRs on rare or specific books. The identification of typographic materials could also be useful for the study of both the aesthetic (such as how the thickness and shape of printing types evolved from the 15th to the mid-16th century) and economic aspects of printing historically. Until the second half of the 16th century, for instance, printing types circulated among workshops, and printers frequently sold or lent types to their fellows.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Centre-Val de Loire</li>
<li>Région Centre</li>
</region>
<settlement>
<li>Tours</li>
</settlement>
<orgName>
<li>Centre Val de Loire Université</li>
<li>Université François-Rabelais de Tours</li>
</orgName>
</list>
<tree>
<country name="France">
<region name="Région Centre">
<name sortKey="Ramel, Jean Yves" sort="Ramel, Jean Yves" uniqKey="Ramel J" first="Jean-Yves" last="Ramel">Jean-Yves Ramel</name>
</region>
<name sortKey="Sidere, Nicolas" sort="Sidere, Nicolas" uniqKey="Sidere N" first="Nicolas" last="Sidère">Nicolas Sidère</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000096 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000096 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    France
   |étape=   Analysis
   |type=    RBID
   |clé=     Hal:hal-01026493
   |texte=   Interactive indexation and transcription of historical printed books
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024